Advanced Chunking Strategies for RAG

Comparing fixed-size, recursive, semantic, agentic, and late chunking methods for optimal retrieval quality

Published

April 2, 2025

Keywords: RAG, chunking, text splitting, recursive chunking, semantic chunking, late chunking, agentic chunking, LlamaIndex, LangChain, embedding, retrieval quality, chunk size, overlap, document-aware splitting

Introduction

Chunking is the single most impactful design decision in a RAG pipeline. Before any embedding model, vector store, or retrieval strategy can do its job, your documents must be sliced into chunks — and how you slice them determines what gets retrieved.

A poor chunking strategy leads to diluted embeddings, mid-sentence breaks, topic mixing, and lost context. A well-chosen strategy preserves semantic boundaries, keeps related information together, and produces embeddings that match user queries accurately.

According to Chroma’s research on evaluating chunking strategies, the choice of chunking strategy can impact recall by up to 9% — the difference between a RAG system that works and one that hallucinates.

This article walks through every major chunking approach — from naive character splitting to LLM-powered agentic chunking and Jina AI’s late chunking — with code examples in LlamaIndex and LangChain, benchmark insights, and practical guidance for production systems.

Why Chunking Matters

The Embedding Bottleneck

Embedding models compress text of any length into a fixed-dimension vector (e.g., 768 or 1536 dimensions). Whether you embed 10 words or 1000 words, the output is the same size. This compression is inherently lossy — larger chunks lose more nuance per token.

graph LR
    A["10 words"] --> B["Embedding<br/>Model"]
    C["1000 words"] --> B
    B --> D["Vector<br/>[768 dims]"]
    B --> E["Vector<br/>[768 dims]"]

    style A fill:#27ae60,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#e74c3c,color:#fff,stroke:#333

The Retrieval Precision Trade-off

Chunk Size	Embedding Quality	Retrieval Precision	Context for LLM
Too small	Sharp, focused	High precision, low recall	May lack context
Optimal	Balanced	Good precision and recall	Sufficient context
Too large	Diluted, coarse	Low precision, high recall	May contain noise

The goal: chunks that are small enough to be semantically focused but large enough to preserve context.

Key Factors in Chunk Design

Embedding model context window — Hard upper limit (typically 512–8192 tokens)
Semantic coherence — Each chunk should represent one idea or topic
Retrieval granularity — Smaller chunks = more precise retrieval
LLM context budget — How much of the context window you allocate to retrieved chunks
Document structure — Headers, tables, lists, code blocks have natural boundaries

Strategy 1: Fixed-Size (Character / Token) Splitting

The simplest approach: split text into chunks of exactly N characters or tokens, with optional overlap.

How It Works

graph LR
    A["Full Document"] --> B["Chunk 1<br/>(0–500)"]
    A --> C["Chunk 2<br/>(400–900)"]
    A --> D["Chunk 3<br/>(800–1300)"]
    A --> E["..."]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#f5a623,color:#fff,stroke:#333
    style E fill:#ccc,color:#333,stroke:#333

LangChain

from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter

# Character-based splitting
char_splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separator=""  # Split on any character boundary
)

# Token-based splitting (more precise)
token_splitter = TokenTextSplitter(
    chunk_size=256,
    chunk_overlap=50,
    encoding_name="cl100k_base"  # GPT-4 tokenizer
)

chunks = token_splitter.split_text(document_text)

LlamaIndex

from llama_index.core.node_parser import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=256,
    chunk_overlap=50,
)

nodes = splitter.get_nodes_from_documents(documents)

When to Use

Quick prototyping where chunking quality is not critical
Uniform-length documents with no structural hierarchy
Baseline comparison against smarter strategies

Limitations

Breaks sentences mid-word or mid-thought
Ignores document structure (headers, paragraphs, tables)
Mixes unrelated topics within a single chunk
Chroma’s evaluation shows TokenTextSplitter at 800 tokens with 400 overlap scored the lowest precision across all metrics

Strategy 2: Recursive Character Splitting

The most popular chunking method in practice. It splits text using an ordered list of separators, trying the largest structural boundaries first and falling back to smaller ones.

How It Works

graph TD
    A["Full Document"] --> B{"Split by \\n\\n<br/>(paragraphs)"}
    B -->|Chunk > max| C{"Split by \\n<br/>(newlines)"}
    B -->|Chunk ≤ max| D["Done ✓"]
    C -->|Chunk > max| E{"Split by .<br/>(sentences)"}
    C -->|Chunk ≤ max| D
    E -->|Chunk > max| F{"Split by space"}
    E -->|Chunk ≤ max| D
    F --> D

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#9b59b6,color:#fff,stroke:#333
    style F fill:#e67e22,color:#fff,stroke:#333

LangChain

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", "?", "!", " ", ""],
    length_function=len,
)

chunks = splitter.split_text(document_text)

LlamaIndex

from llama_index.core.node_parser import SentenceSplitter

# SentenceSplitter is LlamaIndex's equivalent — it respects
# sentence boundaries while targeting a chunk size
splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50,
)

nodes = splitter.get_nodes_from_documents(documents)

Benchmark Results

Chroma’s evaluation found that RecursiveCharacterTextSplitter with chunk size 200, no overlap consistently performs well across metrics:

Configuration	Recall	IoU	Precision_Ω
Recursive (200, no overlap)	88.1%	7.0	29.9
Recursive (400, 200 overlap)	88.1%	3.3	13.9
TokenText (800, 400 overlap)	87.9%	1.4	4.7

Key insight: smaller chunks with no overlap outperform larger chunks with heavy overlap on both recall and token efficiency (IoU).

Separator Choice Matters

The default LangChain separators ["\n\n", "\n", " ", ""] often produce very short chunks. Chroma’s research recommends adding sentence-ending punctuation:

# Better separators for RecursiveCharacterTextSplitter
separators = ["\n\n", "\n", ".", "?", "!", " ", ""]

When to Use

General-purpose RAG — best default choice
Text-heavy documents (articles, reports, books)
You want good results without embedding-model dependency

Strategy 3: Document-Aware (Structural) Splitting

Leverages document structure — markdown headers, HTML tags, code blocks — to create chunks that align with the author’s intended organization.

Markdown Header Splitting

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

chunks = splitter.split_text(markdown_text)
# Each chunk has metadata: {"Header 1": "...", "Header 2": "..."}

HTML Header Splitting

from langchain.text_splitter import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

chunks = splitter.split_text(html_text)

Two-Stage Splitting

In practice, structural splitting produces chunks of highly variable size. Combine it with recursive splitting for consistent chunk sizes:

from langchain.text_splitter import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)

# Stage 1: Split by structure
md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "H1"), ("##", "H2"), ("###", "H3")]
)
structural_chunks = md_splitter.split_text(markdown_text)

# Stage 2: Enforce size limits
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)

final_chunks = text_splitter.split_documents(structural_chunks)

LlamaIndex — MarkdownNodeParser

from llama_index.core.node_parser import MarkdownNodeParser

parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(documents)
# Nodes automatically capture header hierarchy as metadata

When to Use

Well-structured documents (technical docs, wikis, README files)
Multi-format ingestion where you need to preserve hierarchy
When metadata enrichment (section titles) improves retrieval

Strategy 4: Semantic Chunking

Instead of relying on character positions or structural markers, semantic chunking uses embedding similarity to detect topic boundaries.

How It Works

Split text into sentences
Embed each sentence (or sliding window of sentences)
Compute cosine similarity between consecutive sentence embeddings
Detect breakpoints where similarity drops sharply
Group consecutive similar sentences into chunks

graph TD
    A["Sentences"] --> B["Embed each<br/>sentence"]
    B --> C["Compute pairwise<br/>cosine similarity"]
    C --> D{"Similarity<br/>drop > threshold?"}
    D -->|Yes| E["Split here ✂️"]
    D -->|No| F["Continue<br/>grouping"]
    E --> G["Chunk boundaries<br/>aligned to topics"]
    F --> G

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style C fill:#e67e22,color:#fff,stroke:#333
    style D fill:#e74c3c,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#f5a623,color:#fff,stroke:#333
    style G fill:#1abc9c,color:#fff,stroke:#333

LangChain — SemanticChunker

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,  # Split at top 5% similarity drops
)

chunks = chunker.split_text(document_text)

LlamaIndex — SemanticSplitterNodeParser

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small")

splitter = SemanticSplitterNodeParser(
    buffer_size=1,  # Sentences in sliding window
    breakpoint_percentile_threshold=95,
    embed_model=embed_model,
)

nodes = splitter.get_nodes_from_documents(documents)

Cluster Semantic Chunking

Chroma proposed a more sophisticated variant: the ClusterSemanticChunker. Instead of greedily splitting at local breakpoints, it uses dynamic programming to globally maximize intra-chunk cosine similarity:

Method	Recall	IoU	Precision_Ω
Kamradt Semantic (default)	83.6%	1.5	7.4
Kamradt Modified (300 tokens)	87.1%	2.1	10.5
Cluster Semantic (400 tokens)	91.3%	4.5	20.7
Cluster Semantic (200 tokens)	87.3%	8.0	34.0

The Cluster Semantic Chunker at 400 tokens achieved the second highest recall (91.3%) while maintaining strong precision.

Trade-offs

Advantages:

Chunks align with actual topic boundaries
Produces semantically coherent units
Works across document types without structural markers

Disadvantages:

Requires calling an embedding model during chunking (cost + latency)
Chunk sizes are variable and hard to control
Default Kamradt semantic chunking can produce oversized chunks
Embedding model quality directly affects chunk quality

When to Use

Heterogeneous corpora where documents lack consistent structure
Topic-dense documents where paragraphs blend multiple subjects
You can afford the embedding cost during ingestion

Strategy 5: Parent-Child (Hierarchical) Chunking

A retrieval-time strategy that decouples what you search on from what you pass to the LLM. Small chunks (children) are used for precise embedding search; when a child matches, its larger parent chunk is sent to the LLM for richer context.

How It Works

graph TD
    A["Document"] --> B["Parent Chunk<br/>(512 tokens)"]
    B --> C["Child 1<br/>(128 tokens)"]
    B --> D["Child 2<br/>(128 tokens)"]
    B --> E["Child 3<br/>(128 tokens)"]
    B --> F["Child 4<br/>(128 tokens)"]

    G["Query"] --> H["Search children"]
    H --> D
    D --> I["Return parent<br/>for LLM context"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#f5a623,color:#fff,stroke:#333
    style F fill:#f5a623,color:#fff,stroke:#333
    style G fill:#9b59b6,color:#fff,stroke:#333
    style H fill:#e67e22,color:#fff,stroke:#333
    style I fill:#1abc9c,color:#fff,stroke:#333

LlamaIndex — Auto Merging Retriever

from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core import StorageContext, VectorStoreIndex

# Create hierarchical nodes
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[512, 256, 128]  # Parent -> child -> grandchild
)

nodes = node_parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)

# Build index on leaf nodes only
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

index = VectorStoreIndex(leaf_nodes, storage_context=storage_context)

# AutoMergingRetriever returns parent when enough children match
retriever = AutoMergingRetriever(
    index.as_retriever(similarity_top_k=12),
    storage_context,
    simple_ratio_thresh=0.3,  # Merge if 30%+ of children match
)

LangChain — ParentDocumentRetriever

from langchain.retrievers import ParentDocumentRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Child splitter (small chunks for search)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Parent splitter (larger chunks for LLM context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=800)

vectorstore = Chroma(
    collection_name="parent_child",
    embedding_function=OpenAIEmbeddings(),
)
docstore = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(documents)
results = retriever.invoke("What is the attention mechanism?")

When to Use

You need precise search but rich context for the LLM
Documents have varying granularity of information
You want to avoid the precision-context trade-off entirely

Strategy 6: Agentic (LLM-Powered) Chunking

Uses an LLM to decide where to split the document. The LLM reads the text and identifies natural breakpoints based on semantic understanding.

How It Works

Pre-split the document into small fixed-size pieces (e.g., 50 tokens each)
Present the pieces to an LLM with tagged boundaries
Ask the LLM to return which boundaries should be split points
Merge pieces according to the LLM’s decisions

Implementation

from openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter

client = OpenAI()

def agentic_chunk(text: str, model: str = "gpt-4o-mini") -> list[str]:
    # Step 1: Pre-split into small pieces
    splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=0)
    pieces = splitter.split_text(text)

    # Step 2: Tag pieces with boundaries
    tagged = ""
    for i, piece in enumerate(pieces):
        tagged += f"<start_chunk_{i}>{piece}<end_chunk_{i}>"

    # Step 3: Ask LLM to identify split points
    response = client.chat.completions.create(
        model=model,
        messages=[{
            "role": "system",
            "content": (
                "You are a document chunker. Given tagged text pieces, "
                "identify where to split the document into semantically "
                "coherent chunks. Return ONLY the piece indices to split "
                "after, as comma-separated numbers. "
                "Example: split_after: 3, 7, 12"
            )
        }, {
            "role": "user",
            "content": tagged
        }],
        temperature=0,
    )

    # Step 4: Parse split points and merge
    split_text = response.choices[0].message.content
    split_indices = [
        int(x.strip())
        for x in split_text.replace("split_after:", "").split(",")
        if x.strip().isdigit()
    ]

    chunks = []
    current = []
    for i, piece in enumerate(pieces):
        current.append(piece)
        if i in split_indices:
            chunks.append(" ".join(current))
            current = []
    if current:
        chunks.append(" ".join(current))

    return chunks

Benchmark Results

Chroma’s evaluation tested LLM-based chunking with GPT-4o:

Method	Recall	IoU	Precision_Ω
LLM Chunker (GPT-4o)	91.9%	3.9	19.9
Cluster Semantic (400)	91.3%	4.5	20.7
Recursive (200, no overlap)	88.1%	7.0	29.9

The LLM Chunker achieved the highest recall (91.9%) in Chroma’s evaluation, confirming that LLMs are capable chunkers.

Trade-offs

Advantages:

Best semantic understanding of content boundaries
Adapts to any document type or domain
Can handle complex structures (tables, mixed formats)

Disadvantages:

Expensive — requires LLM inference during ingestion
Slow — orders of magnitude slower than heuristic methods
Non-deterministic — same document may chunk differently
Results depend on model quality and prompt engineering

When to Use

High-value, low-volume document sets where quality is paramount
Complex or unusual document formats that heuristics can’t handle
You have the compute budget for LLM-based ingestion

Strategy 7: Late Chunking

A fundamentally different approach proposed by Jina AI (Günther et al., 2024). Instead of chunking before embedding, late chunking applies the transformer model first, then chunks the token embeddings after.

How Traditional vs. Late Chunking Works

graph TB
    subgraph Traditional["Traditional Chunking"]
        A1["Document"] --> A2["Chunk 1"]
        A1 --> A3["Chunk 2"]
        A1 --> A4["Chunk 3"]
        A2 --> A5["Embed"]
        A3 --> A6["Embed"]
        A4 --> A7["Embed"]
        A5 --> A8["Vec 1"]
        A6 --> A9["Vec 2"]
        A7 --> A10["Vec 3"]
    end

    subgraph Late["Late Chunking"]
        B1["Document"] --> B2["Full Transformer<br/>Pass"]
        B2 --> B3["Token Embeddings<br/>(with full context)"]
        B3 --> B4["Chunk 1<br/>Mean Pool"]
        B3 --> B5["Chunk 2<br/>Mean Pool"]
        B3 --> B6["Chunk 3<br/>Mean Pool"]
        B4 --> B7["Vec 1"]
        B5 --> B8["Vec 2"]
        B6 --> B9["Vec 3"]
    end

    style A1 fill:#e74c3c,color:#fff,stroke:#333
    style B1 fill:#27ae60,color:#fff,stroke:#333
    style A8 fill:#e74c3c,color:#fff,stroke:#333
    style A9 fill:#e74c3c,color:#fff,stroke:#333
    style A10 fill:#e74c3c,color:#fff,stroke:#333
    style B7 fill:#27ae60,color:#fff,stroke:#333
    style B8 fill:#27ae60,color:#fff,stroke:#333
    style B9 fill:#27ae60,color:#fff,stroke:#333

    Traditional ~~~ Late
    style Traditional fill:#F2F2F2,stroke:#D9D9D9
    style Late fill:#F2F2F2,stroke:#D9D9D9

The Key Insight

In traditional chunking, each chunk is embedded in isolation — losing references to other parts of the document. When a chunk says “this approach outperforms the baseline”, the embedding doesn’t know what “this approach” or “the baseline” refers to.

Late chunking runs the entire document through the transformer first. Every token’s embedding captures full document context via the attention mechanism. Only then are token embeddings grouped into chunks and mean-pooled into chunk vectors. The result: chunk embeddings that retain cross-chunk context.

Implementation with Jina AI

import requests

# Using Jina AI's API with late chunking
response = requests.post(
    "https://api.jina.ai/v1/embeddings",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "jina-embeddings-v3",
        "input": ["Your full document text here..."],
        "late_chunking": True
    }
)

# Returns chunk embeddings with full document context
embeddings = response.json()["data"]

Manual Late Chunking Concept

For long-context embedding models that expose token-level embeddings:

import torch
import numpy as np

def late_chunking(
    token_embeddings: torch.Tensor,  # (seq_len, hidden_dim)
    chunk_spans: list[tuple[int, int]],  # [(start, end), ...]
) -> list[np.ndarray]:
    """Apply mean pooling per chunk span over contextualized token embeddings."""
    chunk_vectors = []
    for start, end in chunk_spans:
        chunk_tokens = token_embeddings[start:end]
        chunk_vec = chunk_tokens.mean(dim=0).detach().numpy()
        chunk_vectors.append(chunk_vec)
    return chunk_vectors

When to Use

You use a long-context embedding model (e.g., Jina Embeddings v3)
Documents have heavy cross-references and coreferences
You want chunk embeddings that understand document-level context
The embedding model’s context window can fit your documents

Limitations

Requires long-context embedding models (not all models support this)
Document must fit within the model’s context window
Currently best supported through Jina AI’s API
Cannot be applied retroactively to existing embeddings

Strategy 8: Contextual Retrieval (Chunk + Context Header)

Introduced by Anthropic, this approach doesn’t change how you chunk — it enriches each chunk with a context header generated by an LLM that summarizes where the chunk fits within the whole document.

How It Works

Chunk the document using any strategy
For each chunk, prompt an LLM with the full document + chunk
The LLM generates a short context header (2–3 sentences)
Prepend the header to the chunk before embedding

Implementation

from openai import OpenAI

client = OpenAI()

def add_context_header(
    full_document: str,
    chunk: str,
    model: str = "gpt-4o-mini",
) -> str:
    """Generate a context header for a chunk using the full document."""
    response = client.chat.completions.create(
        model=model,
        messages=[{
            "role": "user",
            "content": (
                f"<document>\n{full_document}\n</document>\n"
                f"<chunk>\n{chunk}\n</chunk>\n\n"
                "Give a short succinct context to situate this chunk within "
                "the overall document for the purposes of improving search "
                "retrieval of the chunk. Answer only with the succinct context "
                "and nothing else."
            )
        }],
        temperature=0,
        max_tokens=150,
    )
    context = response.choices[0].message.content
    return f"{context}\n\n{chunk}"


# Apply to all chunks
enriched_chunks = [
    add_context_header(full_doc, chunk) for chunk in chunks
]

When to Use

Chunks frequently lose context (pronouns, relative references)
You can afford LLM calls per chunk during ingestion
Pairs well with any chunking strategy as a post-processing step

Comparison: All Strategies at a Glance

Strategy	Semantic Awareness	Speed	Cost	Chunk Size Control	Best For
Fixed-size	None	⚡ Fastest	Free	Exact	Prototyping
Recursive	Low (separators)	⚡ Fast	Free	Good	General-purpose RAG
Document-aware	Medium (structure)	⚡ Fast	Free	Variable	Structured docs
Semantic	High (embeddings)	🐢 Medium	$ Embedding	Variable	Topic-dense docs
Parent-child	Low–Medium	⚡ Fast	Free	Two-level	Precision + context
Agentic (LLM)	Highest	🐌 Slow	$$$ LLM	Variable	High-value docs
Late chunking	High (contextual)	🐢 Medium	$ Embedding	Good	Cross-referenced docs
Contextual	High (post-hoc)	🐌 Slow	$$$ LLM	Any	Context-poor chunks

Practical Recommendations

Default Starting Point

For most RAG systems, start with RecursiveCharacterTextSplitter:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # ~200 tokens
    chunk_overlap=0,     # Overlap often hurts more than it helps
    separators=["\n\n", "\n", ".", "?", "!", " ", ""],
)

Chroma’s research confirms this produces competitive results without any embedding cost.

Chunk Size Guidelines

Document Type	Recommended Chunk Size	Strategy
Technical docs	200–400 tokens	Recursive + MarkdownHeaders
Legal / financial	300–500 tokens	Document-aware + parent-child
Chat logs / transcripts	150–250 tokens	Semantic
Knowledge base articles	200–400 tokens	Recursive
Code repositories	Per function/class	Document-aware (AST-based)

Decision Flowchart

graph TD
    A["Start"] --> B{"Documents<br/>well-structured?"}
    B -->|Yes| C["Document-aware<br/>+ Recursive fallback"]
    B -->|No| D{"Budget for<br/>embedding calls?"}
    D -->|Yes| E{"Need cross-chunk<br/>context?"}
    D -->|No| F["Recursive Character<br/>Splitter"]
    E -->|Yes| G["Late Chunking<br/>or Contextual Retrieval"]
    E -->|No| H["Semantic Chunking"]
    C --> I{"Need precise search<br/>+ rich LLM context?"}
    I -->|Yes| J["Add Parent-Child<br/>retrieval"]
    I -->|No| K["Done ✓"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style F fill:#27ae60,color:#fff,stroke:#333
    style G fill:#9b59b6,color:#fff,stroke:#333
    style H fill:#e67e22,color:#fff,stroke:#333
    style J fill:#e74c3c,color:#fff,stroke:#333
    style K fill:#1abc9c,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333

Things to Avoid

Don’t default to large chunks with heavy overlap — OpenAI Assistants’ default of 800 tokens / 400 overlap scored worst in benchmarks
Don’t ignore your separator list — the default ["\n\n", "\n", " ", ""] produces inconsistent chunks; add punctuation separators
Don’t assume one strategy fits all — mix strategies per document type in your pipeline
Don’t skip evaluation — always measure chunking impact on your actual queries

Evaluating Your Chunking Strategy

Token-Level Metrics

Following Chroma’s research, evaluate chunking with token-level metrics instead of document-level:

Recall — What fraction of relevant tokens were retrieved?
Precision — What fraction of retrieved tokens were relevant?
IoU (Intersection over Union) — How well do retrieved chunks overlap with relevant excerpts?

\text{IoU} = \frac{|t_e \cap t_r|}{|t_e| + |t_r| - |t_e \cap t_r|}

where t_e is the set of relevant excerpt tokens and t_r is the set of retrieved tokens.

Quick Evaluation Setup

def evaluate_chunking(chunks, queries_with_excerpts, embed_model, top_k=5):
    """Evaluate a chunking strategy on a set of queries and known excerpts."""
    from sklearn.metrics.pairwise import cosine_similarity
    import numpy as np

    chunk_embeddings = embed_model.embed_documents(chunks)
    results = {"recall": [], "precision": [], "iou": []}

    for query, excerpt_tokens in queries_with_excerpts:
        query_emb = embed_model.embed_query(query)
        sims = cosine_similarity([query_emb], chunk_embeddings)[0]
        top_indices = np.argsort(sims)[-top_k:]

        retrieved_tokens = set()
        for idx in top_indices:
            retrieved_tokens.update(chunks[idx].split())

        relevant = set(excerpt_tokens)
        intersection = relevant & retrieved_tokens

        recall = len(intersection) / len(relevant) if relevant else 0
        precision = len(intersection) / len(retrieved_tokens) if retrieved_tokens else 0
        union = len(relevant) + len(retrieved_tokens) - len(intersection)
        iou = len(intersection) / union if union else 0

        results["recall"].append(recall)
        results["precision"].append(precision)
        results["iou"].append(iou)

    return {k: np.mean(v) for k, v in results.items()}

Conclusion

Chunking is not a solved problem — it’s a design decision that depends on your documents, your embedding model, your queries, and your budget. The landscape spans from zero-cost heuristics to expensive LLM-powered approaches, and the right choice depends on your constraints.

Key takeaways:

Start with RecursiveCharacterTextSplitter at ~200 tokens, no overlap — it’s the best cost-performance default
Use document-aware splitting when your documents have clear structure
Semantic chunking pays off for topic-dense, unstructured text
Parent-child retrieval solves the precision-vs-context dilemma without changing how you chunk
Late chunking is the most principled approach for preserving cross-chunk context, but requires compatible embedding models
Agentic and contextual approaches deliver the highest quality but at significant cost
Always evaluate — use token-level metrics (IoU, recall, precision) on your real queries

The best RAG systems often combine multiple strategies: document-aware splitting with recursive fallback, parent-child retrieval for context, and contextual headers for disambiguation. Start simple, measure, and iterate.

Pair your chunking strategy with the right embedding model and reranker for maximum retrieval quality.
Measure the impact of your chunking choices with RAG evaluation metrics like context recall and faithfulness.
Explore GraphRAG for documents where entity relationships matter more than semantic similarity.
Build an agentic RAG system that dynamically selects the best retrieval strategy per query.

Introduction

Why Chunking Matters

The Embedding Bottleneck

The Retrieval Precision Trade-off

Key Factors in Chunk Design

Strategy 1: Fixed-Size (Character / Token) Splitting

How It Works

LangChain

LlamaIndex

When to Use

Limitations

Strategy 2: Recursive Character Splitting

How It Works

LangChain

LlamaIndex

Benchmark Results

Separator Choice Matters

When to Use

Strategy 3: Document-Aware (Structural) Splitting

Markdown Header Splitting

HTML Header Splitting

Two-Stage Splitting

LlamaIndex — MarkdownNodeParser

When to Use

Strategy 4: Semantic Chunking

How It Works

LangChain — SemanticChunker

LlamaIndex — SemanticSplitterNodeParser

Cluster Semantic Chunking

Trade-offs

When to Use

Strategy 5: Parent-Child (Hierarchical) Chunking

How It Works

LlamaIndex — Auto Merging Retriever

LangChain — ParentDocumentRetriever

When to Use

Strategy 6: Agentic (LLM-Powered) Chunking

How It Works

Implementation

Benchmark Results

Trade-offs

When to Use

Strategy 7: Late Chunking

How Traditional vs. Late Chunking Works

The Key Insight

Implementation with Jina AI

Manual Late Chunking Concept

When to Use

Limitations

Strategy 8: Contextual Retrieval (Chunk + Context Header)

How It Works

Implementation

When to Use

Comparison: All Strategies at a Glance

Practical Recommendations

Default Starting Point

Chunk Size Guidelines

Decision Flowchart

Things to Avoid

Evaluating Your Chunking Strategy

Token-Level Metrics

Quick Evaluation Setup

Conclusion

Read More